Submitted by : Ng ao yang
-For this problem, we use only 1 output variable (price)
Using machine learning tools to predict the price of the respective houses in king county, and find out which features have alot of significance in affecting the price of a house.
These few blocks of codes is to import necessary modules used later on.
import sklearn
import numpy as np
import pandas as pd
import matplotlib
import platform
message=" Versions "
print("*"*len(message))
print(message)
print("*"*len(message))
print("Scikit-learn version={}".format(sklearn.__version__))
print("Numpy version={}".format(np.__version__))
print("Pandas version={}".format(pd.__version__))
print("Matplotlib version={}".format(matplotlib.__version__))
print("Python version={}".format(platform.python_version()))
#importing numpy and pandas, seaborn
import numpy as np #linear algebra
import pandas as pd #datapreprocessing, CSV file I/O
import seaborn as sns #for plotting graphs
import matplotlib.pyplot as plt
import data into the variables
data = pd.read_csv('kc_house_data.csv')
data.head()
In this section, we will try to analyse and explore the data,and play around with it, to get familiarise and find out more about the data
data.info()
data.describe()
data.shape
c = data.corr(method="pearson")
c
top2 = c.nlargest(2, 'price').index
top2
top10 = c.nlargest(10,'price').index
top10
data.isnull().sum()
data.columns.values
data.dtypes
data.head()
sns.distplot(data['price'],kde=False,color='darkred',bins=40)
plt.xticks(rotation=45)
sns.countplot(x="bedrooms",data=data)
#tells me how many people had siblings on the ship with them
ax = sns.barplot(x="bedrooms", y="price", data=data)
data['sqft_above'].min()
data['sqft_above'].max()
data.head()
g = sns.pairplot(data, hue="grade")
plt.figure(figsize=(10,10))
ax = sns.regplot(x=data['sqft_living'], y=data['price'], marker="+")
ax = sns.jointplot(x=data['lat'].values, y=data['long'].values,size = 5)
plt.figure(figsize=(10,10))
ax = sns.regplot(x=data['bedrooms'], y=data['price'], marker="+")
plt.figure(figsize=(10,10))
ax = sns.regplot(x=data['floors'], y=data['price'], marker="x",fit_reg=False)
data.head()
data = data.drop('id',axis=1)
data['date'] = pd.to_datetime(data['date'])
data.price = data.price.astype(int)
data.bathrooms = data.bathrooms.astype(int)
data.floors = data.floors.astype(int)
data.head()
data["house_age"] = data["date"].dt.year - data['yr_built']
data=data.drop('date', axis=1)
data=data.drop('yr_renovated', axis=1)
data=data.drop('yr_built', axis=1)
data.info()
data.head()
x = [data]
data=data.drop('zipcode', axis=1)
data.describe()
data = data.drop('sqft_above',axis=1)
data = data.drop('sqft_basement',axis=1)
data.head()
f = 6.450000e+05
filt = (data['price'] <= f) & (data['bedrooms'] <= 11)
data = data.loc[filt]
data.bedrooms.unique()
data.bathrooms.unique()
data.apply(pd.Series.nunique)
data.bathrooms.unique()
data.bedrooms.unique()
data['price'] = np.log(data['price'])
data.head()
X = data
Y = X['price'].values
X = X.drop('price', axis = 1).values
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.20, random_state=21)
data.head()
X_test
Y_test
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
reg = GradientBoostingRegressor()
kfold = KFold(n_splits=15, random_state=21)
cv_results = cross_val_score(reg, X_train,Y_train, cv=kfold, scoring='r2')
print(cv_results)
round(np.mean(cv_results)*100, 2)
k_range = list(range(70,120))
print(k_range)
# create a parameter grid: map the parameter names to the values that should be searched
# value: list of values that should be searched for that parameter
# single key-value pair for param_grid
param_grid = dict(n_estimators=k_range)
print(param_grid)
# instantiate the grid
grid = GridSearchCV(reg, param_grid, cv=kfold, scoring='r2')
# fit the grid with data
grid.fit(X_train,Y_train)
#finding the best estimator for the model
best_fit = grid.best_estimator_
best_fit
reg = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=119, n_iter_no_change=None, presort='auto',random_state=None, subsample=1.0, tol=0.0001,validation_fraction=0.1, verbose=0, warm_start=False)
kfold = KFold(n_splits=15, random_state=21)
cv_results = cross_val_score(reg, X_train,Y_train, cv=kfold, scoring='r2')
print(cv_results)
round(np.mean(cv_results)*100, 2)
reg.fit(X_train, Y_train)
score = reg.score(X_test,Y_test)
print("r2 score is {}".format(score*100))
pred = reg.predict(X_test)
error = sqrt(mean_squared_error(Y_test,pred))
print('RMSE is ' , error)
reg = KNeighborsRegressor()
kfold = KFold(n_splits=15, random_state=21)
cv_results = cross_val_score(reg, X_train, Y_train, cv=kfold, scoring='r2')
print(cv_results)
round(np.mean(cv_results)*100, 2)
k_range = list(range(5,70))
print(k_range)
# create a parameter grid: map the parameter names to the values that should be searched
# value: list of values that should be searched for that parameter
# single key-value pair for param_grid
param_grid = dict(n_neighbors=k_range)
print(param_grid)
# instantiate the grid
grid = GridSearchCV(reg, param_grid, cv=kfold, scoring='r2')
# fit the grid with data
grid.fit(X_train,Y_train)
#finding the best estimator for the model
best_fit = grid.best_estimator_
best_fit
reg = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=23, p=2,weights='uniform')
kfold = KFold(n_splits=15, random_state=21)
cv_results = cross_val_score(reg, X_train, Y_train, cv=kfold, scoring='r2')
print(cv_results)
round(np.mean(cv_results)*100, 2)
reg.fit(X_train, Y_train)
score = reg.score(X_test,Y_test)
print("r2 score is {}".format(score*100))
pred = reg.predict(X_test)
error = sqrt(mean_squared_error(Y_test,pred))
print('RMSE is ' , error)
reg = LinearRegression()
kfold = KFold(n_splits=15, random_state=21)
cv_results = cross_val_score(reg, X_train, Y_train, cv=kfold, scoring='r2')
print(cv_results)
round(np.mean(cv_results)*100, 2)
reg.fit(X_train, Y_train)
score = reg.score(X_test,Y_test)
print("r2 score is {}".format(score*100))
pred = reg.predict(X_test)
error = sqrt(mean_squared_error(Y_test,pred))
print('RMSE is ' , error)
regr = RandomForestRegressor()
cv_results = cross_val_score(regr,X_train, Y_train, cv=kfold,scoring='r2' )
print(cv_results)
round(np.mean(cv_results)*100, 2)
k_range = list(range(5,70))
print(k_range)
# create a parameter grid: map the parameter names to the values that should be searched
# value: list of values that should be searched for that parameter
# single key-value pair for param_grid
param_grid = dict(n_estimators=k_range)
print(param_grid)
# instantiate the grid
grid = GridSearchCV(regr, param_grid, cv=kfold, scoring='r2')
# fit the grid with data
grid.fit(X_train,Y_train)
#finding the best estimator for the model
best_fit = grid.best_estimator_
best_fit
regr = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,max_features='auto', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=62, n_jobs=None,oob_score=False, random_state=None, verbose=0, warm_start=False)
cv_results = cross_val_score(regr,X_train, Y_train, cv=kfold,scoring='r2' )
print(cv_results)
round(np.mean(cv_results)*100, 2)
regr.fit(X_train, Y_train)
score = regr.score(X_test,Y_test)
print("r2 score is {}".format(score*100))
pred = regr.predict(X_test)
error = sqrt(mean_squared_error(Y_test,pred))
print('RMSE is ' , error)
regressor = DecisionTreeRegressor()
kfold = KFold(n_splits=15, random_state=21)
cv_results = cross_val_score(regressor,X_train, Y_train, cv=kfold,scoring='r2' )
print(cv_results)
round(np.mean(cv_results)*100, 2)
k_range = list(range(5,30))
print(k_range)
# create a parameter grid: map the parameter names to the values that should be searched
# value: list of values that should be searched for that parameter
# single key-value pair for param_grid
param_grid = dict(max_depth=k_range)
print(param_grid)
# instantiate the grid
grid = GridSearchCV(regressor, param_grid, cv=kfold, scoring='r2')
# fit the grid with data
grid.fit(X_train,Y_train)
#finding the best estimator for the model
best_fit = grid.best_estimator_
best_fit
regressor = DecisionTreeRegressor(criterion='mse', max_depth=10, max_features=None,max_leaf_nodes=None, min_impurity_decrease=0.0,min_impurity_split=None, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,presort=False, random_state=None, splitter='best')
kfold = KFold(n_splits=15, random_state=21)
cv_results = cross_val_score(regressor,X_train, Y_train, cv=kfold,scoring='r2' )
print(cv_results)
round(np.mean(cv_results)*100, 2)
regressor.fit(X_train, Y_train)
scores = reg.score(X_test,Y_test)
print("r2 score is {}".format(scores*100))
pred = regressor.predict(X_test)
error = sqrt(mean_squared_error(Y_test,pred))
print('RMSE is ' , error)
from sklearn.feature_selection import SelectFromModel
sel = SelectFromModel(GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=119, n_iter_no_change=None, presort='auto',random_state=None, subsample=1.0, tol=0.0001,validation_fraction=0.1, verbose=0, warm_start=False))
sel.fit(X_train,Y_train)
sel.get_support()
sel
feat_labels = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','waterfront','view','condition','grade','lat','long', 'sqft_living15', 'sqft_lot15', 'house_age']
data.columns
# Create a random forest classifier
reg = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=119, n_iter_no_change=None, presort='auto',random_state=None, subsample=1.0, tol=0.0001,validation_fraction=0.1, verbose=0, warm_start=False)
# Train the classifier
reg.fit(X_train, Y_train)
# Print the name and gini importance of each feature
for feature in zip(feat_labels, reg.feature_importances_):
print(feature)
# Create a selector object that will use the random forest to identify
# features that have an importance of more than 0.125
sfm = SelectFromModel(reg, threshold=0.125)
sfm.fit(X_train, Y_train)
# Print the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
print(feat_labels[feature_list_index])
I convert my data to the respective datatypes then i processed my data. i also did data cleansing so that my data explain my respective features better.
firstly i dropped the id column, then i convert the date to datetime format, convert the data type of price, bathrooms, floors to integer. I created a new variable call house age by minusing the time it was built and the year it was sold. Then i divide the price of houses in redmond and seatle by 2 to account for the housing price inflation in the area indirectly cause by the tech companies. then i droped sqft abovee and basement as it is roughlythe same. i filter the price of houses whereby if price is bigger than 5.302000e+05 , then it will not be included. This is done as it is the 75% mark of the price, to remove outliers. i also filter the bedrooms such that any houses with bedrooms more than 11 is removed as it is an outlier as most houses dont have 11 bedrooms.Then i log the price to reduce outliers
-no , i did not use any additional datasets.
-Yes i tuned the learning algorithm with GridsearchCV, from the range of parameters from a wide variety of numbers to find out the optimal results. It took awhile to loop though and find the best model (roughly 30 minutes if im having good connection).
To evaluate the quality of my system, i used k-fold cross validation whereby k = 15, because through trial and error, i prefer to use k as 15. k-fold cross validation (according to documentations)(i wll link it in the research) is where u shuffle datasets randomly, and put the dataset in k groups,and for each unique group,take the group as a hold out or test data set,take the remaining groups as a training data set.Fit a model on the training set and evaluate it on the test set.Retain the evaluation score and discard the model.Summarize the skill of the model using the sample of model evaluation scores. I also use random state = 21 so everytime i train the model i ensure that the random numbers are generated in the same order. I also compare the actual predicted value with the cv score to show how much it differs from it. the closer the predicted score to the cv score the better. I also took in consideration of the Root mean square error to show how much the predicted value differs from the actual value. All this is done to prevent overfitting and underfitting of the model.
i do not know how to evaluate the stupidbase line. I tried creating a kaggle competition but i dont how to.
I used root mean square error to estimate the average of squares of the errors, which measures the average squared difference between the estimated values and the actual value. If Mean square error is large, it means that the data values are dispersed widely around the mean, and a smaller MSE means otherwise and it is better as it meanas that the data values are dispersed closely to the mean, which means the the model is fitted. but if RMSE is too low then it might mean overfitting.
yes, it is possble. my random forest model says that sqft_living and latitude is the most important feature that my model considers important
King County is a county located in the U.S. state of Washington. The population was 2,252,782 in the 2019 census estimate, making it the most populous county in Washington, and the 12th-most populous in the United States. The county seat is Seattle, also the state's most populous city. Cities in king county include seattle , redmond which are home to some of the most popular tech companies in the world, which actually inflates the cost of living,mainly the price of housing and basic necessites. The Housing cost in Redmond is roughly 2 times of the rest of the cities in washington and 3 times than the rest of usa. Redmond is home to microsoft and Amazon aws office,which could be responsible for the high cost of housing there with competition of workers trying to get a house in the city to go to work conveniently. Seattle is home to Amazon HQ which responsible for the high cost of housing and living there. Houses in seattle is almost roughly 2 times of the rest of the cities in washington and 3 times the rest of usa. In both cities, housing is the biggest factor in cost of living difference.
Locations, home size and usable spaces, views, have a important factors on the price of housing. The bigger the home size and usable spaces the higher the price of house is. With a beautiful view outside of the house, the house price naturally increases as well
Economic factors not included in the dataset such as, inflation and interests rates have known to impact the price of housing too. when the interests rates are low, it is easier to take a loan from a bank to make big purchases such as houses, as such, prices of housing can vary differently throughout different economic times.